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Abstract 


We develop a latent variable model and an efficient spectral algorithm motivated 
by the recent emergence of very large data sets of chromatin marks from multiple 
human cell types. A natural model for chromatin data in one cell type is a Hidden 
Markov Model (HMM); we model the relationship between multiple cell types by 
connecting their hidden states by a fixed tree of known structure. 

The main challenge with learning parameters of such models is that iterative meth¬ 
ods such as EM are very slow, while naive spectral methods result in time and 
space complexity exponential in the number of cell types. We exploit properties 
of the tree structure of the hidden states to provide spectral algorithms that are 
more computationally efficient for current biological datasets. We provide sample 
complexity bounds for our algorithm and evaluate it experimentally on biological 
data from nine human cell types. Finally, we show that beyond our specific model, 
some of our algorithmic ideas can be applied to other graphical models. 


1 Introduction 

In this paper, we develop a latent variable model and efficient spectral algorithm motivated by the 
recent emergence of very large data sets of chromatin marks from multiple human cell types Elia. 
Chromatin marks are chemical modifications on the genome which are important in many basic 
biological processes. After standard preprocessing steps, the data consists of a binary vector (one 
bit for each chromatin mark) for each position in the genome and for each cell type. 

A natural model for chromatin data in one cell type is a Hidden Markov Model (HMM) ll^ [131 . 
for which efficient spectral algorithms are known. On biological data sets, spectral algorithms have 
been shown to have several practical advantages over maximum likelihood-based methods, including 
speed, prediction accuracy and biological interpretability ll24l . Here we extend the approach by 
modeling multiple cell types together. We model the relationships between cell types by connecting 
their hidden states by a fixed tree, the standard model in biology for relationships between cell 
types. This comparative approach leverages the information shared between the different data sets 
in a statistically unified and biologically motivated manner. 

Formally, our model is an HMM where the hidden state zt at time t has a structure represented by 
a tree graphical model of known structure. For each tree node u we can associate an individual 
hidden state that depends not only on the previous hidden state for the same tree node u but 
also on the individual hidden state of its parent node. Additionally, there is an observation variable 
a;“ for each node u, and the observation is independent of other state and observation variables 
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conditioned on the hidden state variable z“. In the bioinformatics literature, 13 studied this model 
with the additional constraint that all tree nodes share the same emission parameters. In biological 
applications, the main outputs of interest are the learned observation matrices of the HMM and a 
segmentation of the genome into regions which can be used for further studies. 

A standard approach to unsupervised learning of HMMs is the Expectation-Maximization (EM) al¬ 
gorithm. When applied to HMMs with very large state spaces, EM is very slow. A recent line of 
work on spectral learning ifTSl [T] |23] ID has produced much more computationally efficient algo¬ 
rithms for learning many graphical models under certain mild conditions, including HMMs. How¬ 
ever, a naive application of these algorithms to HMMs with large state spaces results in computa¬ 
tional complexity exponential in the size of the underlying tree. 

Here we exploit properties of the tree structure of the hidden states to provide spectral algorithms 
that are more computationally efficient for current biological datasets. This is achieved by three 
novel key ideas. Our first key idea is to show that we can treat each root-to-leaf path in the tree 
separately and learn its parameters using tensor decomposition methods. This step improves the 
running time because our trees typically have very low depth. Our second key idea is a novel tensor 
symmetrization technique that we call Skeletensor construction where we avoid constructing the full 
tensor over the entire root-to-leaf path. Instead we use carefully designed symmetrization matrices 
to reveal its range in a Skeletensor which has dimension equal to that of a single tree node. The third 
and final key idea is called Product Projections, where we exploit the independence of the emission 
matrices along the root-to-leaf path conditioned on the hidden states to avoid constructing the full 
tensors and instead construct compressed versions of the tensors of dimension equal to the number 
of hidden states, not the number of observations. Beyond our specific model, we also show that 
Product Projections can be applied to other graphical models and thus we contribute a general tool 
for developing efficient spectral algorithms. 

Einally we implement our algorithm and evaluate it on biological data from nine human cell types 
0. We compare our results with the results of Q who used a variational EM approach. We also 
compare with spectral algorithms for learning HMMs for each cell type individually to assess the 
value of the tree model. 


1.1 Related Work 

The first efficient spectral algorithm for learning HMM parameters was due to M- There has been 
an explosion of follow-up work on spectral algorithms for learning the parameters and structure of 
latent variable models ESISlia. El gives a spectral algorithm for learning an observable operator 
representation of an HMM under certain rank conditions. Il2l and 0 extend this algorithm to 
the case when the transition matrix and the observation matrix respectively are rank-deficient. E3 
extends E8l to Hidden Semi-Markov Models. 

0 gives a general spectral algorithm for learning parameters of latent variable models that have a 
multi-view structure - there is a hidden node and three or more observable nodes that are not con¬ 
nected to any other nodes and are independent conditioned on the hidden node. Many latent variable 
models have this structure, including HMMs, tree graphical models, topic models and mixture mod¬ 
els. El provides a simpler, more robust algorithm that involves decomposing a third order tensor. 
lEijisaEa provide algorithms for learning latent trees and of latent junction trees. 

Several algorithms have been designed for learning HMM parameters for chromatin modeling, in¬ 
cluding stochastic variational inference and contrastive learning of two HMMs 12^ . However, 
none of these methods extend directly to modeling multiple chromatin sequences simultaneously. 


2 The Model 

Probabilistic Model. The natural probabilistic model for a single epigenomic sequence is a hidden 
Markov model (HMM), where time corresponds to position in the sequence. The observation at 
time t is the sequence value at position t, and the hidden state at t is the regulatory function in this 
position. 
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Figure 1 ; Left; A tree T with 3 nodes V = {r,u,v}. Right; A HMM whose hidden state has 
structure T. 


In comparative epigenomics, the goal is to jointly model epigenomic sequences from multiple 
species or cell-types. This is done by an HMM with a tree-structured hidden state || 5 l(THS-HMM){^ 
where each node in the tree representing the hidden state has a corresponding observation node. 
Formally, we represent the model by a tuple % = (G, O, T, W); Figureshows a pictorial repre¬ 
sentation. 

G = (V,E) is a directed tree with known structure whose nodes represent individual cell-types 
or species. The hidden state zt and the observation xt are represented by vectors {z^} and {x^} 
indexed by nodes u G L. If {v, u) G E, then v is the parent of u, denoted by 7r(u); if u is a parent 
of u, then for all t, z^ is a parent of 2;“. In addition, the observations have the following product 
structure; if u' 7^ u, then conditioned on z“, the observation a;“ is independent of 2“ and as 
well as any zji and for t ^ t'. 

O is a set of observation matrices 0“ = P{x'^\z^) for each u G V and T is a set of transition 
tensors T“ = P(zJYi -Zt+i^) for each u G V. Finally, W is the set of initial distributions where 
Wn = for each z“. 

Given a tree structure and a number of iid observation sequences corresponding to each node of 
the tree, our goal is to determine the parameters of the underlying THS-HMM and then use these 
parameters to infer the most likely regulatory function at each position in the sequences. 

Below we use the notation D to denote the number of nodes in the tree and d to denote its depth. 
For typical epigenomic datasets, D is small to moderate ( 5 - 50 ) while d is very small (2 or 3 ) as 
it is difficult to obtain data with large d experimentally. Typically to, the number of possible val¬ 
ues assumed by the hidden state at a single node, is about 6 - 25 , while n, the number of possible 
observation values assumed by a single node is much larger (e.g. 256 in our dataset). 


Tensors. An order -3 tensor M G ® ® is a 3 -dimensional array with nin2'n^ entries, 

with its (ii, 12, iaj-th entry denoted as 

Given x 1 vectors * = 1 , 2 , 3 , their tensor product, denoted by (g) r;2 <8) V3 is the ni x n2 x ns 
tensor whose (ii, 12, Jsj-th entry is (ui)ij (1^2)12 (^3)13- A. tensor that can be expressed as the tensor 
product of a set of vectors is called a rank 1 tensor. A tensor M is symmetric if and only if for any 
permutation tt ; [ 3 ] ^ [ 3 ], = ^7r(ii),77(i2).^(*3)- 

Let M G (g)K”2 (g)]R”3^ ify. g then M(Vi, V2, V3) is a tensor of size toi x TO2 x m3, 

whose (L,t2,*3)-thentry is; M(Vl,V 2 ,V 3 )i^,^^,^^ = ^(^2)12,12(^3)33,is- 

Since a matrix is a order -2 tensor, we also use the following shorthand to denote matrix multipli¬ 
cation. Let M G (g) If Vi G M(Vi, V2) is a matrix of size toi x m2, 

whose (li,i2)-th entry is; M(Li, (^2)i2.i2- This is equivalent to 

MV2. 


’in the bioinformatics literature, this model is also known as a tree HMM. 
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Meta-States and Observations, Co-occurrence Matrices and Tensors. Given observations a;“ 
and at a single node u, we use the notation to denote their expected co-occurence frequen¬ 
cies: = ]E[a:“ ® and to denote their corresponding empirical version. The tensor 

® Xfi 0 x’^u] and its empirical version are defined similarly. 

Occasionally, we will consider the states or observations corresponding to a subset of nodes in G 
coalesced into a single meta-state or meta-observation. Given a connected subset S' C y of nodes in 
the tree G that includes the root, we use the notation zf and xf to denote the meta-state represented 
by {zi,u G S) and the meta-observation represented by {x'^,u G S) respectively. We define the 
observation matrix for S as = P(xf\zf) G R"'®'and the transition matrix for S as 
) G respectively. 

For sets of nodes Vi and V 2 , we use the notation P^l,’^^ to denote the expected co-occurrence 
frequencies of the meta-observations x'^^ and x^i^. Its empirical version is denoted by . 

Similarly, we can define the notation and its empirical version 


Background on Spectral Learning for Latent Variable Models. Recent work by IT] has pro¬ 
vided a novel elegant tensor decomposition method for learning latent variable models. Applied to 
HMMs, the main idea is to decompose a transformed version of the third order co-occurrence tensor 
of the first three observations to recover the parameters; m shows that given enough samples and 
under fairly mild conditions on the model, this provides an approximation to the globally optimal 
solution. The algorithm has three main steps. First, the third order tensor of the co-occurrences is 
symmetrized using the second order co-occurrence matrices to yield a symmetric tensor; this sym¬ 
metric tensor is then oithogonalized by a whitening transformation. Finally, the resultant symmetric 
orthogonal tensor is decomposed via the tensor power method. 

In biological applications, instead of multiple independent sequences, we have a single long se¬ 
quence in the steady state. In this case, following ideas from 1^ . we use the average over t of the 
third order co-occurence tensors of three consecutive observations starting at time t. The second 
order co-occurence tensor is also modified similarly. 


3 Algorithm 


A naive approach for learning parameters of HMMs with tree-structured hidden states is to directly 
apply the spectral method of m. Since this method ignores the structure of the hidden state, its 
running time is very high, even with optimized implementations. This motivates the 

design of more computationally efficient approaches. 

A plausible approach is to observe that at f = 1, the observations are generated by a tree graphical 
model; thus in principle one could learn the parameters of the underlying tree using existing algo¬ 
rithms Il22ll2ni25]l. However, this approach does not directly produce the HMM parameters; it also 
does not work for biological sequences because we do not have multiple independent samples at 
f = 1; instead we have a single long sequence at the steady state, and the steady state distribution 
of observations is not generated by a latent tree. Another plausible approach is to use the spectral 
junction tree algorithm of ll25l : however, this algorithm does not provide the actual transition and 
observation matrix parameters which hold important biological information, and instead provides 
an observable operator representation. 

Our main contribution is to show that we can achieve a much better running time by exploiting the 
structure of the hidden state. Our algorithm is based on three key ideas - Partitioning, Skeletensor 
Construction and Product Projections. We explain these ideas next. 


Partitioning. Our first observation is that to learn the parameters at a node u, we can focus only on 
the unique path from the root to u. Thus we partition the learning problem on the tree into separate 
learning problems on these paths. This maintains correctness as proved in the Appendix. 
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The Partitioning step reduces the computational complexity since we now need to learn an HMM 
with states and n'^ observations, instead of the naive method where we learn an HMM with 
states and observations. As d D in biological data, this gives us significant savings. 

Constructing the Skeletensor. A naive way to learn the parameters of the HMM corresponding to 
each root-to-node path is to work directly on the 0{n‘^ x n'^ x n‘^) co-occurrence tensor. Instead, 
we show that for each node it on a root-to-node path, a novel symmetrization method can be used 
to construct a much smaller skeleton tensor T“ of size n x n x n, which nevertheless captures the 
effect of the entire root-to-node path and projects it into the skeleton tensor, thus revealing the range 
of 0“. We call this the skeletensor. 

Let Hu be the path from the root to a node u, and let be the empirical x nx 

tensor of co-occurrences of the meta-observations Hu, u and Hu at times 1, 2 and 3 respectively. 
Based on the data we construct the following symmetrization matrices; 
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^2,1 \^3,1 I 


Note that Si and S '3 are n x matrices. Symmetrizing with and S3 gives us an 

nxnxn skeletensor, which can in turn be decomposed to give an estimate of 0“ (see Lemmaj^in 
the Appendix). 

Even though naively constructing the symmetrization matrices and skeletensor takes -f 

rPp time, this procedure improves computational efficiency because tensor construction is a one¬ 
time operation, while the power method which takes many iterations is carried out on a much smaller 
tensor. 

Product Projections. We further reduce the computational complexity by using a novel algo¬ 
rithmic technique that we call Product Projections. The key observation is as follows. Let 
Hu = {uq, Ml,..., Ud-i] be any root-to-node path in the tree and consider the HMM that generates 
the observations (a;“°, xp ,..., xp~^) for t = 1,2,.... Even though the individual observations 
xp ,j = 0, 1,..., d — 1 are highly dependent, the range ofO^^, the emission matrix of the HMM 
describing the path Hu, is contained in the product of the ranges ofO^^, where 0 “^ is the emission 
matrix at node Uj (Lemmaj^in the Appendix). Eurthermore, even though the 0“’ matrices are diffi¬ 
cult to find, their ranges can be determined by computing the SVDs of the observation co-occurrence 
matrices at Uj. 

Thus we can implicitly construct and store (an estimate of) the range of . This also gives us 
estimates of the range of the column spaces of Pp^'^ and Pp^'^, and the range of the 

first and third modes of the tensor Ppfp’^'^. Therefore during skeletensor construction we can 

avoid explicitly constructing Si, S3 and and instead construct their projections onto their 

ranges. This reduces the time complexity of the skeletensor construction step to 

0{NrrP‘^~^^ -f -f dmrP) (recall that the range has dimension m.) While the number of hidden 
states m could be as high as n, this is a significant gain in practice, as n 3 > m in biological datasets 
(e.g. 256 observations vs. 6 hidden states). 

Product projections are more efficient than random projections lEl on the co-occurrence matrix of 
meta-observations: the co-occurrence matrices are n'^ x nf matrices, and random projections would 
take n{np time. Also, product projections differ from the suggestion of ifTSl since we exploit 
properties of the model to efficiently find good projections. 


3.1 The Full Algorithm 

Our final algorithm follows from combining the three key ideas above. Algorithm shows how 
to recover the observation matrices 0“ at each node u. Once the 0“s are recovered, one can use 
standard techniques to recover T and W; details are described in Algorithm|^in the Appendix. 

3.2 Product Projections beyond HMMs with Tree-structured Hidden States 

The Product Projections technique is a general technique with applications beyond our model. 
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Algorithm 1 Algorithm for Observation Matrix Recovery 

1 : Input: N samples of the three consecutive observations {xi,X 2 , generated by an HMM 

with tree structured hidden state with known tree structure. 

2: for M G y do 

3: Perform SVD on P ^’2 to get the first m left singular vectors t/“. 

4: end for 
5: for M G y do 

6 : Let Hu denote the set of nodes on the unique path from root r to u. Let 

7: Construct Projected Skeletensor. First, compute symmetrization matrices: 






Hu r)Hu',Hu -fjHu ^ — 1 


8 : Compute symmetrized second and third co-occurrences for u: 

9: Orthogonalization and Tensor Decomposition. Orthogonalize using decom¬ 

pose to recover (0",..., 9^) as in m (See Algorithm|^in the Appendix for details). 

10: Undo Projection onto Range. Estimate O" as: 6" = P"0", where 0" = (0",..., 9^). 

11: end for 


Application 1: HMM with more general hidden states. Consider an HMM with a hidden state 
represented by a general graphical model G = (U, E) with an observation variable cc" corresponding 
to each u G V. xf is independent of all other hidden state and observation nodes, conditioned on its 
corresponding hidden state variable z". In this case, = iS>uevO'^. Similar graphical models 
have been used in biology to model gene expression time courses ini. 

Application 2: HMM with rank-deficient observation matrix. Consider an HMM whose obser¬ 
vation matrix O is rank-deficient. In this case, Q suggests compressing sequences of successive 
observations of size s for s = 2,3,... until the matrices = P{xt,Xt+i, ■ ■ ■ ^Xt+s-i\zt) and 
= P{xt, Xt-i, ..., Xt-s+i\zt) have rank m. A version of ifTSll is then run using observation se¬ 
quence pairs Pi:s,s-i-i: 2 s and triples Pi:^ s+i_s+ 2 : 2 s-i-i- In this case, we can show that both range(O'I^) 
and range(O^) are contained in range(0®®); we can therefore use Product Projections to improve 
the n(n®) running time to 


3.3 Performance Guarantees 

We now provide performance guarantees on our algorithm. Since learning parameters of HMMs 
and many other graphical models is NP-Hard, spectral algorithms make simplifying assumptions on 
the properties of the model generating the data. Typically these assumptions take the form of some 
conditions on the rank of certain parameter matrices. We state below the conditions needed for our 
algorithm to successfully learn parameters of a HMM with tree structured hidden states. Observe 
that we need two kinds of rank conditions - node-wise and path-wise - to ensure that we can recover 
the full set of parameters on a root-to-node path. 

Assumption 1 (Node-wise Rank Condition). For all u G V, the matrix O" has rank m, and the 
joint probability matrix P^’i has rank m. 

Assumption 2 (Path-wise Rank Condition). For any u G V, let Hu denote the path from root to u. 
Then, the joint probability matrix has rank 

Assumption!^ is required to ensure that the skeletensor can be decomposed, and that ?7“ indeed 
captures the range of O". Assumption [^ensures that the symmetrization operation succeeds. This 
kind of assumption is very standard in spectral learning nan- 
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13 has provided a spectral algorithm for learning HMMs involving fourth and higher order moments 
when Assumption[T]does not hold. We believe similar approaches will apply to our problem as well, 
and we leave this as an avenue for future work. 

If Assumptions[T]and|^hold, we can show that Algorithm[2is consistent - provided enough samples 
are available, the model parameters learnt by the algorithms are close to the true model parameters. 
A finite sample guarantee is provided in the Appendix. 

Theorem 1. [Consistency] Suppose we run Algorithm on the first three observation vectors 
{xi^i,Xi^ 2 ,Xi^ 3 } from N iid sequences generated by an HMM with tree-structured hidden states. 
Then, for all nodes u G V, the recovered estimates 0“ satisfy the following property: with high 
probability over the iid samples, there exists a permutation 11“ of the columns of 0“ such that as 
< £{N) where £{N) 0 as N ^ oo. 

Observe that the observation matrices (as well as the transition and initial probabilities) are recovered 
upto permutations of hidden states in a globally consistent manner. 


4 Experiments 

Data and experimental settings. We ran our algorithm, which we call “Spectral-Tree”, on a chro¬ 
matin dataset on human chromosome 1 from nine cell types (Hl-hESC, GM12878, HepG2, HMEC, 
HSMM, HUVEC, K562, NHEK, NHLE) from the ENCODE project Q. Eollowing Q, we used a 
biologically motivated tree structure of a star tree with Hl-hESC, the embryonic stem cell type, as 
the root. There are data for eight chromatin marks for each cell type which we preprocessed into 
binary vectors using a standard Poisson background assumption M. The chromosome is divided 
into 1,246,253 segments of length 200, following ifTTI . The observed data consists of a binary vec¬ 
tor of length eight for each segment, so the number of possible observations is the number of all 
combinations of presence or absence of the chromatin marks (i.e. n = 2® = 256). We set the 
number of hidden states, which we interpret as chromatin states, to m = 6, similar to the choice 
of ENCODE. Our goals are to discover chromatin states corresponding to biologically important 
functional elements such as promoters and enhancers, and to label each chromosome segment with 
the most probable chromatin state. 

Observe that instead of the first few observations from N iid sequences, we have a single long 
sequence in the steady state per cell type; thus, similar to ||23]| . we calculate the empirical co¬ 
occurrence matrices and tensors used in the algorithm based on two and three successive obser¬ 
vations respectively (so, more formally, instead of Pi, 2 , we use the average over t of 
so on). Additionally, we use a projection procedure similar to m for rounding negative entries in 
the recovered observation matrices. Our experiments reveal that the rank conditions appear to be 
satisfied for our dataset. 

Run time and memory usage comparisons. Eirst, we flattened the HMM with tree-structured 
hidden states into an ordinary HMM with an exponentially larger state space. Our Python imple¬ 
mentation of the spectral algorithm for HMMs of ifTSl ran out of memory while performing singular 
value decomposition on the co-occurence matrix, even using sparse matrix libraries. This suggests 
that naive application of spectral HMM is not practical for biological data. 

Next we compared the performance of Spectral-Tree to a similar model which additionally con¬ 
strained all transition and observation parameters to be the same on each branch Q. That work used 
several variational approximations to the EM algorithm and reported that SME (structured mean 
field) performed the best in their tests. Although we implemented Spectral-Tree in Matlab and did 
not optimize it for run-time efficiency, Spectral-Tree took ~2 hr, whereas the SME algorithm took 
^13 hr for 13 iterations to convergence. This suggests that spectral algorithms may be much faster 
than variational EM for our model. 

Biological interpretation of the observation matrices. Having examined the efficiency of 
Spectral-Tree, we next studied the accuracy of the learned parameters. We focused on the obser¬ 
vation matrices which hold most of the interesting biological information. Since the full observation 
matrix is very large (2® x 6 where each row is a combination of chromatin marks), Eigurej^ shows 
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the 8x6 marginal distribution of each chromatin mark conditioned on each hidden state. Spectral- 
Tree identified most of the major types of functional elements typically discovered from chromatin 
data: repressive, strong enhancer, weak enhancer, promoter, transcribed region and background state 
(states 1-6, respectively, in Figure]^). In contrast, the SMF algorithm used three out of the six states 
to model the large background state (i.e. the state with no chromatin marks). It identified repressive, 
transcribed and promoter states (states 2, 4, 5, respectively, in Figure [^) but did not identify any 
enhancer states, which are one of the most interesting classes for further biological studies. 


H3K27ac 

0.0 0.0 0.0 0.0 ^27j0.0 

H3K27me3 

0.0 0.070.02 0.0 0.04 0.01 

H3K36me3 

0.0 0.0 0.0 0.160.09 0.0 

H3K4me1 

o 

b 

o 

b 

o 

b 

o 

b 

3] 

o 

b 

H3K4me2 

0.0 0.0 0.0 0.0 0.39 0.0 

H3K4me3 

0.0 0.0 0.0 0.0 0.23 0.0 

H3K9ac 

0.0 0.0 0.0 0.0 0.23 0.0 

H4K20me1 

0.0 0.0 0.0 0.030.06 0.0 


1 2 3 4 5 6 


(a) SMF 


H3K4me10.01 
H3K4me20.02 
H3K4me30.01 
H3K9ac 


0.01 


H3K27ac 0.0^|0.13^|0.01 0.0 
H3K27me30.350.010.01 0.0 0.0 0.01 
H3K36me3lo.010.050.04 0.01 

to.o 



H4K20me10.020.02 0.01 0.010.02 0.0 
1 2 3 4 5 6 

(b) Spectral-Tree 


Figure 2: The compressed observation matrices for the GM12878 cell type estimated by the SMF 
and Spectral-Tree algorithms. The hidden states are on the X axis. 


We believe these results are due to that fact that the background state in the data set is large: ^62% 
of the segments do not have chromatin marks for any cell type. The background state has lower 
biological interest but is modeled well by the maximum likelihood approach. In contast, biologi¬ 
cally interesting states such as promoters and enhancers comprise a relatively small fraction of the 
genome. We cannot simply remove background segments to make the classes balanced because it 
would change the length distribution of the hidden states. Finally, we observed that our model esti¬ 
mated significantly different parameters for each cell type which captures different chromatin states 
(Appendix Figure 3). For example, we found enhancer states with strong H3K27ac in all cell types 
except for Hl-hESC, where both enhancer states (3 and 6) had low signal for this mark. This mark 
is known to be biologically important in these cells for distinguishing active from poised enhancers 
cni. This suggests that modeling the additional branch-specific parameters can yield interesting 
biological insights. 

Comparison of the chromosome segments labels. We computed the most probable state for each 
chromosome segment using a posterior decoding algorithm. We tested the accuracy of the predic¬ 
tions using an experimentally defined data set and compared it to SMF and the spectral algorithm for 
HMMs run for individual cell types without the tree (Spectral-HMM). Specifically we assessed pro¬ 
moter prediction accuracy (state 5 for SMF and state 4 for Spectral-Tree in Figure]^ using CAGE 
data from M which was available for six of the nine cell types. We used the El score (harmonic 
mean of precision and recall) for comparison and found that Spectral-Tree was much more accurate 
than SME for all six cell types (Table [T]|. This was because the promoter predictions of SME were 
biased towards the background state so those predictions had slightly higher recall but much lower 
specificity. 

Einally, we compared our predictions to Spectral-HMM to assess the value of the tree model. Hl- 
hESC is the root node so Spectral-HMM and Spectral-Tree have the same model and obtain the 
same accuracy (Table[^. Spectral-Tree predicts promoters more accurately than Spectral-HMM for 
all other cell types except HepG2. However, HepG2 is the most diverged from the root among the 
cell types based on the Hamming distance between the chromatin marks. We hypothesize that for 
HepG2, the tree is not a good model which slightly reduces the prediction accuracy. 

Our experiments show that Spectral-Tree has improved computational efficiency, biological inter- 
pretability and prediction accuracy on an experimentally-defined feature compared to variational 
EM for a similar tree HMM model and a spectral algorithm for single HMMs. A previous study 
showed improvements for spectral learning of single HMMs over the EM algorithm ll24ll . Thus our 
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Cell type 

SME 

Spectral-HMM 

Spectral-Tree 

Hl-hESC 

.0273 

.1930 

.1930 

GM12878 

.0220 

.1230 

.1703 

HepG2 

.0274 

.1022 

.0993 

HUVEC 

.0275 

.1221 

.1621 

K562 

.0255 

.0964 

.1966 

NHEK 

.0287 

.1528 

.1719 


Table 1: FI score for predicting promoters for six cell types. The highest FI score for each cell type 
is emphasized in bold. Ground-truth labels for the other 3 cell-types are currently unavailable. 


algorithms may be useful to the bioinformatics community in analyzing the large-scale chromatin 
data sets currently being produced. 
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A Recovering the Transition Probabilities and Initial Probabilities 

AlgorithrnJ^recovers the transition and initial probabilities, given estimates of observation matrices. 
Theorem [provides finite sample guarantees on Algorithm[2in conjunction with Algorithmic 


Algorithm 2 Recovering the Transition Probabilities and Initial Probabilities 
1: Input: N samples of the hrst three observations generated by a tree HMM, 

Estimates of observation matrices 0“. 

2: for M G C do 

3: if u is root r then 

4: Compute . 

5: Compute . 

6: Normalize over the coordinate to get T“. 

7: else 

8: Compute PC" = (0“)tp"f 

9: Compute Q“ = 

10: Normalize over the Zn coordinate to get T“. 

11: end if 

12: end for 


B Additional Notations 

For a node u G V, when it is clear from context, we sometimes use H to denote 7T„ and d to denote 
du- 

Define O 2 to be a x rrP matrix whose rows are indexed by elements in [np and columns are 
indexed by elements in In particular, (02^)(zi,...,irf),(ji,... = P[x 2 = ii,... ,X 2 = = 

3i,---,Z 2 = jd)- Similarly we define Of whose entries are = P{xl = 

ii,...,x^ = = ji, • ■ • ,^2 = id), and Of whose its entries are (Of = 

P{xi = ii,... ,Xi = id\z 2 = jiT ■ ■ T ^2 — jd)- We define O 2 to be a n x m‘^ matrix, whose 
rows are indexed by elements in [n], and columns are indexed by elements in \mp. Its entries are 

(O 2 = Pi.xi = *14 = ii, ■ • ■, 4 = jd)- 
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Define to be a vector representing the marginal probability of (z 2 ,..., particular, its rows 

are indexed by elements in [ 771 ]“^, and = P(z 2 = ii,Z 2 = id). Define 7 r“ to be a vector 

representing the marginal probability of Z 2 - In particular, its rows are indexed by elements in [m], 
and 7r“ = P{z 2 = *)■ Define as min^ 7r“. Define as the m'^ dimensional vector representing 
the marginal probability of (z[,..., 2 :“) whose entries are indexed by elements in In particular, 
= P(zi = ii,..., 2 ;“ = id). is defined as the mf' x mf' matrix representing the 
conditional probability of Z 2 given z^ , and its rows and columns are indexed by elements in [mY‘, 
in particular, = -^(4 = *i,- ■ - = ji,- ■ •, zj' = jd). 

Let M be a node in V. Define C/“ to be a matrix whose columns form an orthonormal basis of 0“. 
One way to get C7“ is to take its columns to be the top m singular vectors of 0“. The specific choice 
of C/“ does not affect our analysis, as we will be only looking at the projection matrix 
throughout. Define [/^ to be 

For a matrix M, define ||M|| to be its operator norm, that is, max||„||^i_||„||^i ||u^Mu|j. Define 
the Frobenius norm of M, ||M|jp’ to be square root of the sum of the square of its entries, that is, 

^ij- standard results in linear algebra, ||M|| < ||M||i;’. Similarly, for a third order 

tensor T, define ||T|| to be its operator norm, that is max||„||^i ||„||=i ||u,||=i T(u, u, tu). Define 
the Frobenius norm of T, ||T|jF to be square root of the sum of the square of its entries, that is, 

standard results of linear algebra, ||r|| < ||r||i^. 

C Main Lemmas 

C.l Partitioning Lemmas 

Lemma 1 (Path Partitioning). Suppose observations and states {x'l,z^'\y^v,t^'H drawn from 
a THS-HMM represented by % = (G, T, 0,kF), where G = {V,E), T = {Ty,v G V}, O = 
{Oy,v G V}, W = {Wy,v G V'\. Let u G V, and let denote nodes inside the unique path 
from root r to u. Then {x'l, generated by a THS-HMM represented by a tuple 

H = (G,T, 0,kF), where G = {V,E) is the induced subgraph on In particular, V = 

E = {(u,7r(t-))}„6ffj, f = {n,vG Hu}, 6 = {Ou,v G H^}, W = {FF.,u G iF„}. 

Proof of Lemmaflj To show this lemma, we will calculate the marginal distribution of the variables 
{x}, Observe that the full joint distribution of {x}, is equal to: 

r— 1 r 

v^G t—lv^G 


To calculate the marginal over {x}, z'}}y^Hu,t&[T\y we eliminate the rest of the variables one by one. 
Observe that we can eliminate any observation variable x} for v f. H^ without introducing any extra 
edges, as x} is only connected to z'}. Moreover, marginalizing x} gives: Pr(a;( = x\z} = z) = 

1 . 

Let G be the current tree; initially G = G. We next eliminate the nodes {z},t = t, ... ,1} for 
V Hu one by one where v f. Hu is a leaf node in G. We do this in the order zf, ..., z"; 

once we have eliminated these nodes, we delete v from G, and we continue until only the nodes in 
Hu are left. To eliminate a z^ when {z^, s > t} have been eliminated, we sum over: = 

z|zj_ 2 , which also sums to 1. 

We repeat this process until only the nodes {x}, z"}}u^Hu,t^[T\ left- Since we get 1 from elimi¬ 

nating each variable, the marginal we are left with is: 


T-l 


n n n 


i+ll 


t(i;) 


)n n 


vGHu 


t^l vGHu 


vGHu 


( 1 ) 


11 




which is the marginal distribution of an HMM with tree-structured hidden states described by the 
tuple (G, f, O, W). The lemma follows. □ 


The following is a Corollary of Lemma[T] 

Corollary 1. If observations and states {x'l, (ire drawn from a THS-HMM represented 

by (G, T, O, W), then the sequence of coalesced observations and states }tgN ore drawn 

from an HMM. 

Proof. The proof is a simple extension of Lemma pi Q gives us the marginal distribution of 
{ajj, Observe that for any t, conditioned on x^ " is d-separated from all the 

other nodes of the graph - this is because for any node x in the graphical model, and x 

either form a chain or or a fork structure whose middle node is Zf . Moreover, conditioned on z^ , 
z^ff is d-separated from the set of nodes This is because 27 “, z^ “ and z^ff form a 

chain structure whose middle node is . The lemma thus follows. □ 

C.2 Skeletensor Lemmas 

In this subsection, we justify our construction of a skeletensor. Let u be any node in the tree G and 
let H be the path from the root of G to u. 

Recall that we define Of^ to be the n‘^ x m‘^ matrix, whose entries are = 

P(x^ = ii,... ,x'I = id\zo = ji,... ,Zn = jd). Similarly, O? is a x m‘^ matrix, with entries 

{oth, .,-5=..14=j.). 

We begin by showing that under Assumptions[^and|^ the matrices and for the three-view 
mixture model induced by the HMM have full column rank. 

Lemma 2. Let u be a node in V. Recall that H = is the set of nodes along the path from root r 
to u. Then: 

(1) The matrices diag{p^){T^)^ diag{'K^)~^ and are of full rank. 

(2) The matrices and are of full column rank. 

Proof. By Lemma[^ x^ , x^ , x^ are conditionally independent given . Thus, 

Pif = Of diag( 7 r")(Of)^ 

Since by Assumption 2 is of rank m'^, this implies that the matrix Of must be of rank as 

well. By Proposition t/ of^H, 

Of = 0"diag(p^)(T^)^diag(7r")-i 

This implies that diag(p^)(r^)^diag( 7 r^)“^ is of rank m'^, which is of full rank. Hence is of 
full rank. By Proposition 4.2 of 12, 

of = 

This shows Of is of full column rank. □ 

Second, we discuss the infinite sample version of our symmetrization matrix. This will be extended 
in Lemmaj^in our detailed finite sample analysis. 

Lemma 3. Let u be a node in V. Recall that Hu is the set of nodes along the path from root r to u. 
Assume T’f are given (where = {Pii^)'^)- Lef the symmetrization matrices 

be: 

QU _ t^u,H / 

*^1 — ^ 2,3 ) 

s- = pffiP,Y)^ 

and the ground truth symmetrized pair-wise and triple-wise co-occurence tensors be: 

M^ = P^f{Sfj) 

Mf=P^£f(Sf,I,Sf) 
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Then, 

i 

MS = E 0 (0“)i 0 (0“), 

i 


Proof. By Lemma[^ , x^, x^ are conditionally independent given z|^, thus 


P^f = 02diag(7r")0^ 


H\r)HT 


= of diag( 7 r^) 03 ^^ 

Lemma ^implies that Of is of full column rank, and diag(7r^)0|^^ is of full row rank. Therefore 
by standard properties of pseudoinverse, 

«3^)t = (diag(.^)03^^)t(0f )t 

Therefore, 

Likewise, 

Then, 


Mf = 



= 02 (Of 


^ 3 “ = 

M“ = 


= 

E <...dni02K...dn^iOSfu...dn 

ii 

= 


= 

(0“), 

i 

jyH,,u^H / quT t quT\ 

— ^1,2,3 Wl 7-^7*^3 ) 

II 

IM 

^L.dn(0^)n,...dn ® (OShu...dn ® (O^hd 

= E 





□ 


C.3 Product Projections Lemmas 

C.3.1 Product Projections in HMM with Tree Hidden States 

Lemma 4. O^, the observation matrix of the HMM that generates the meta-states and meta¬ 
observations {z^, xf }tgN> equals O”. 

Proof We consider the observation matrix of the HMM that generates the meta-states and meta¬ 
observations The number of possible meta-hidden states is mf, indexed by 

{z^)y^H and the number of possible meta-observations xf is vf, indexed by {x^)v^h- Thus, the 
observation matrix is of dimension x mf. Entrywise, 

= V{xl = ii,...,xS = id\zl =jd) 

~ ■ ■ ■ ^id,jd 

= ((^ O 

v&H 
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Where the second equality uses conditional independence. Therefore, ^ 

C.3.2 Product Projections Beyond HMM with Tree Hidden States 

We consider the case of a simple HMM when the observation matrix O is not full rank. In this case, 
we first define forward and backward observation matrices Of and Of formally. For a fixed s, Of 
is a X TO matrix, with rows indexed by a s-tuple (ji, ..., jg) G [n]®, and columns indexed by 
i G [to]. Entry wise, 

= * 2 , ■ • ■,xt+s-i = is\zt = j) 

Similarly we define backward observation matrices Of = P{xt, Xt-i ,..., Xt-s+i\zt)- Entrywise, 

(^s P{,Xi , Xi — i g-|-l tgjZi j) 

The claim is the range of the forward(backward) observation matrices is contained in the range of 
the s-wise Kronecker product of the original observation matrices. 

Lemma 5. 

range{Ol) C range{0^^) 
range(Og) C range{0®’^) 

Proof. We prove the first relationship, since the proof of the second is almost identical. 

Note that by the law of total probability, 

(^s )(n,12,i 

= P{xt = il,Xt+l = ■*2, • ■ ■,Xt+s-l = is\zt = j) 

= ^ P{xt =il,Xt+i =i2,...,Xt+s-l =is\zt = j,Zt+i = j 2 ,...,Zt+s-l = js) 

32, ■■■Js 

xP{zt+l = J 2 ■ ■ -^Zt+s-l = js\zt = j) 

^ ■ ■ ■ ^is,jsPi^t+l J 2 ■ ■ • ; -^t+s —1 js\zt j) 

32.,--,3s 

~ ^ ' (O® ■ ■ ■ ) ^t+s —1 “ Jsj^t = j) 

32, ■■■,3s 

Thus, each column of Of is a linear combination of the columns of O®*, thus completing the proof. 

□ 


D Finite Sample Guarantees 

Theorem 2 (Accuracy of Initial Distribution and Transition Probabilities). There exists a universal 
constant C such that the following hold. Suppose Algorithm^is given as input N iid observation 
triples {xii,Xi 2 ,Xif}f^.y generated by a THS-HMM, and outputs estimates of observaton matrices 
O^, for each node u in the tree. Then Algorithm^is run on the same sample and has {0“}ugv as 
input. If the size of sample N is greater than: 


C max I 


, D 
-^ 5 -^ In —, 


CT^CT 


2^3 


m 

~2 ' 

crfa 


i„2. 


m 


1^2 


-)U,U 


D m 
^ ’ 2 s' 

d (72^x6 


m 


, D 

’ .x62t14,.4 


In^ 




where CTi = min„gv tT„(0“), CT 2 = min^gy am{Pi’ 2 )> 0-3 = niin„gy cr^d{P^fP") and 7r„iin = 
min„ i 7r“, then with probability > 1 — (5 over the training examples, with probability 0.9 over the 
random initializations in Algorithm^ there exist permutation matrices {n“}„gy such that for all 
uGV, 

||0“ - (d“n“)|| < e 

ifu is the root node, then, 

||IE“ - (n“)^IE“|| < e 
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Otherwise, 


\\w^ - < e 

IIQ" - Q“(n“,n“,n’"(“))|| < e 

We emphasize that our algorithm recovers the initial probability and transition probability tensors 
up to permutations of hidden states in a globally consistent manner. In contrast to EOl where some 
hidden nodes do not have observations directly associated with them, in our setting, each hidden 
state has an associated observation, which makes recovery of permutations easier. How to perform 
parameter recovery in a THS-HMM with internal hidden states where each hidden tree node does 
not have an associated observation is an interesting question for future work. 


E Proofs 


Throughout this section, we first assume a technical condition on the sample size. This will result in 
concentration of the projection and the symmetrization matrices. 

Assumption 3. Recall that D = \ V\. The sample size N is large enough that 


< min 


e{N,S) 

• / T~%u,u\ • 

mWlu^V (^ni[Pi'2 ) 




16D 

mm„gy cr^(P“ 2 ") min„gv crm(O^) min„gy min^gy (Tm(0“)^7r; 


33/2 , 


4:^/m 

3/2 Q 
f O'2(T3 0-20-1 7r^inO-iO-3 \ 


1536cito 


( 2 ) 


V 16D ’ 4i/m’ 1536 cito / 

Where Ci > 0 is a constant given in Lemma 11 and CTi, 0 - 2 , 0-3 and Tr^ju are defined in Theorem^ 


E.l Raw Moments Concentration 

We start with standard concentration of raw moments, which uses the fact that all the (vectorized) 
raw moments can be viewed as a probability vector. Let u be a node in V, recall that H is the set of 
nodes along the path from root r to u. 

Let e(iV, 5) = . Define event 

P = { for all M G L : ||P“^“ - Pi“ 2 “||f < e(W, <5) 

\\Pif-Pyn\F<e{N,5) 

\\P 2 f -P 2 f\\F< e{N, 5) 

\\P^f -P^f\\F< e{N, 5) 

\\Pyi':f -Piif\\F< e{N, 5) 
\\Plf-PnF<e{N,5) 

\\Pi:P-Pi:nF<<N,5) 

||P/f(“) - P“f^“^||i. < e{N,5) 

^^puMu),u _ puMu),u^^^ 


Lemma 6 (Concentration of Raw Moments). P(P) > 1 — i5. 
Proof. Applying Proposition 19 in ifTSl along with union bound. 


□ 













E.2 Subspace Concentration 


Next we state a useful lemma that says that conditioned on the event E, performing an SVD on the 
empirical version of = E[a;“ ® X 2 ] gives us a good approximation to the range of Recall 
that [/“ is a matrix whose columns form an orthonormal basis of 0“, and define is 
Also, recall for a matrix U with orthonormal columns, the projection matrix onto range({7) is UU^. 


Lemma 7 (Subspace Concentration). Smposes N is large enough such that Assumption^holds. 
[/“ is the output of line 3 of Algorithm^ Let u be a node in V, recall that H is the set of nodes 
along the path from root r to u. Then conditioned on event E, we have: 

(1) ||C/“(C/“)^ - 

In particular. 


T - - T 1 

||[/“([ 7 “)T_t/«(; 7 «)T|| < jnin(- 


SD 


2 y/rn 


( 2 ) 

(3) 


3 ) 


£/"([/")' -[/"([/")' II < 


1,3 






2 


Proof (1) the matrix of principal angles between range((7“) and range([/“), is such that 


< 


< 


||sin$“|| 

e{N,S) 

2e{N,S) 


(3) 


where the first inequality is by Theorem 4 by taking A = Pf'f and A = Pf'f ; the second inequality 
from Assumption!^ which implies that^A^, 6) < am{P\ 2 ') 12. 

Thus, by Equation (|^ in Assumption]^ 


I . ^ ■ min„gy cr„(P3^3"’-^“) min^gy (7^(0“) 

|sin$ ||<min(-—-,-—=-) 


8£i 


2y/rn 


The result follows from the fact that 


II sin$“|| = ||(7“(C/“)^ - (7“([7“)^|| 


(2) First we enumerate the nodes in Hu '■ Hu = {ui,..., u;}. 

||yff(jjff)T_yH(^ff)T|i 

< II (C/“1 0 ... 0 (U*" ([/"' )^) III + ■ • ■ + II ® {U'’^ (^"0^)11 

2e(A^, 5) 

^ min„gy 3 ) 

“ 8 

where the first inequality is by triangle inequality, the second inequality uses standard facts about 
Kronecker product (||A0i3|| = ||A||||f3|j), the third inequality is from Equation ([^, the fourth 
inequality is from Equation 
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(3) By item (1) we know that 

\\U^{U^V - i7“(i7“)^|! < a^(0“)/(2V^) 

Hence 

||C7“(i7")TO“ - C/"(/7“)^0“|| < ||C/“(C/“)^ - i7“(l7“)^|||jO“|| < o-„(0“)/2 

where the second inequality is from the fact that 0“ is a column stochastic matrix, which implies 
that ||0“|| < |10 “||f < Vm. 

Therefore by Theorem 

> a„(0“)/2 

□ 


E.3 Symmetrized Moment Concentration 

Lemma 8. Suppose we are given a set of matrices 17“, u G V such that is invertible for 

all u G V. Moreover, assume the expected second order moments iPii i P^'i ’ third 
order moments are given. Consider the symmetrization matrices: 

= {{U^V P2fU^)iiU^V Pyfu^)~^ 

= {{U^y P^fu^){{U^y pp^u^)-^ 

and the ground truth symmetrized second order and third order cooccurence matrices be: 

Mf = p^f{ij^{s'iy,u^) 

Mf = Pi£i^{S'^v, U'^, IJ^Sf) 

Then, 

Mf = ^<((17“)^0“),(g) ((17“)^0“), 

i 


Proof Recall that by Lemma|^ 

Of = 0^diag(p^)(T^)^diag(7r^)-i 
where diag(p^)(r^)^diag(7r'^)“^ is invertible. Thus, 


(17")^Of = (f/^)^O^diag(p^)(r^)^diag(7r^)-i 


This shows that (?7^)^Of is invertible. 

On the other hand, 

of = 

where is invertible. Thus, 

(f/")^Of = {U^)^O^T^ 


This shows that {U^)^0^ is invertible. 

Therefore, 

= ((17“)^0“diag(7r^)0f'^17^)((17")^0fdiag(7r^)0f^0^)-i 
= ((t>“r02“)((17")^Of)-i 
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Likewise, 


S'^ 

= aun^0^dmgi7r^)0rU^)iiU^V0^dmgin^)0rU^)-^ 

= iiun^omu^vo^)-^ 

Then, 

21,...,2D 

= E <...,^niiu-Von^n<^{iu^Von^n 

21,...,2D 


M" 


Pi%^{U^{S^V,U^, U^Sf) 


y 




((t/“) ' 02“)z 


2 )l\,...,lD 


® ((P“)^o“) 


2l,...,2£) 


E <,...yyvo-),, ® {iuyo-\, ® [ipyo-),, 

n,•••,*£) 


E 7r“((?7“)^0“), 0 ((P“)^0“)i 0 ((i7“)^0“). 


□ 


We next establish a result that shows that the symmetrization matrices Sf and obtained in Line 
7 of Algorithm [^concentrate to 5“ and S'^ defined in Lemmaj^ Recall from Algorithm that; 


s't = iiuyp2,y 






p"“) 


5“ = 




Lemma 9. Suppose N is large enough that Assumption^holds. Recall Si and are the outputs 
of line 7 in Algorithm^ 

, and Si and Sf are defined in Leinma^ 

Conditioned on event E, the following hold for all u G V. 


kr-kiMi53“-4“ii < 


10e(A^, 6) 

ikriukriuka iui4"ii <— 

^m'^[Pl,3 ) 


Proof (1) We first show that a-^d{{U^)^P^^^U^) > 3a^d{P^^^)/A, and 

am4iU^VPiyU^)>a^d{PiY)/2. 

Under Assumption]^ by Item (2) of Lemma]^ we know that 


Wu^iu^y 


U^{U^)^\\ 


< min a^d 
u&V 


{PiT)/8 


(4) 
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As a result, 




< \\{tj^{U^y - U^{U^)^)Pl^fU^{U^)^\\ 
+\\U^{U^)^P^i^{U^{U^)^ - U^{U^)^)\\ 

< \\{u^{u^V-u^{u^V)\mT\\\\u^iu^V\\ 
+\\u^{u^V\\\\p,Y\m^iu^V - u^iu^V)\\ 


(5) 


< amiP,T)/^ + ^rn{Pr,r)/^ 

< ^n^(^lT)/4 (6) 

where the first inequality is by triangle inequality, in the second inequality we use the fact that 
||A • B\\ < ||A||||i3||, the third inequality is from the fact that ||-Pi^ 3 ^|| < II^/s^IIf < 
||^F(^F)T|| ^ ^ 1 and Equation 

Therefore, 


<7^40^)^ p,yu^) 

tH ifTH\T r>H,HfjH 




> {P0^) - II t/" (t/«) ' P^f [/" (t/") ' - P/:3 


F)T) 

T jti ( 


) ) 




> 3cr^ 




(7) 


where the first inequality is by Theorem]^ the second inequality is by Equation]^ 
In the meantime. 


Wiu^Y p0^u^ - (^")' 

< e{,N,5)<at{P^f)/A 


H\T fyH.H fjH \ 


( 8 ) 


where in the first inequality we use the fact that || ^7^ || = 1 , the second inequality is by the fact that 


{N, S), the third inequality follows from Assumptionj^ 


ifE; happens, 11 Pi 3 - P^ 3 || < 

Therefore 

a^4{U0^PiYu^) 

> ar^400^PiTu0 - \\00^PiTu^ - 00^PiTu^\\ 

> ^m{Pif)l2 

where the first inequality is from Theorem]^ the second inequality is from Equation (|^. 
We now have 


\\0-s^\ 


= ll((C7“) 


P^fu^){{U^)' Pj 


1,3 


p^fu^){{u^Yp0^u^) 


< ii((t^“)' {P20 - P20)u^m0^p0^u^)-^^ 




< 


< 


< 


H\T TjH,HfTH\-l\ 


wu^ip^f - p^fw^iwm^)' PiTu^) 


.,H -nU,H\TTH\\\\^ frT^\ \ Jdp 

+m^yp2fu^\\\m0^Pifu^)-^ - 

2e{N,S) 8eiN,S) 


HM 


O'm'll-' 1,3 ) 

10e(Af, 5) 


f uH.H\ 2 
(^mpPl,3 r 


PPiT) 


(9) 
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In the derivation above, the first inequality uses triangle inequality and the second inequality repeat¬ 
edly uses the fact that ||A-_B|| < ||A||||_B||. The third inequality is obtained by bounding each term 
individually as follows: 


< 

< 


II m ' (p^T - ^2T II < II^2T - ^2T II < II^2T - Pi" 11^ < 

jHj'jH ^ — 1II 1 /^ \~r /'jH ^ ^ o /^ \ 


ll((t^^ 

II 

ll((^^ 






< llA“f II < 


2,3 


F ^ 


< 1 


- {{U ^)' 


jjH,H fjH\ 

3^1 

H\T oF,H fjH— 

^ 11 rt-..- /11 ^ ^T'tH \T t)H,H j jH \ — 11 


2|!(P") ' {P^f - P^f)U^\\ max(||((P")' P^fU^)-% \m^VPlfU^)-^' 

8e{N, 6) 


^{PiTr 


where the last inequality follows from Theorem]^ 


a^d[P^^ ) 


The bound of H^g — S'g || is handled similarly. 

(2) First, 

ii^rii < ii(?7“)^P2“f [/^iiii((p^)^P3'"i"p^)-^ii < 

1,3 

where the first inequality is by the fact that H^-PH < ||2l||||P|j, the second inequality is by Equa¬ 
tion (|7]i. 

Meanwhile, Assumptionj^implies < ^m'^i.Pil^)/^^ therefore from Equation (|^, 

ii^r-5rii< ^ 


/ pH,H^ 


Hence by triangle inequality. 


i^rii < 


f^m'*(Pl,3 ) 


The bounds of H^g || and \\S^\\ are handled similarly. 


□ 


Built upon the previous two lemmas, we next provide a result regarding the concentration of sym¬ 
metrized moments. 

Lemma 10. Suppose N is large enough that As sumption^holds. Let u be a node in V. Then on 
the event E, the following hold. 


||M“ 




< 


14e(fV,^) 

t r>H,H-.2 


IIM" 


M"!! < 


96e(A^, 6) 

TrfHfPE 

C^m4^1,3 ) 


Proof (1) Define P“ = P“) and P“ = P“). Then, 

\\Mf-Mf\\ 

= \\p-{Csf)^,i)-p-{{s'iY,i)\\ 

< i!(p“ - p“)((5rr,/)ii + iii^“(W)^ - is^v,i)\\ 

< ||P“-P“||||5rii + iiP“iiii5r-5rii 

^ 4e{N,S) ^ 10e{N,S) ^ 14e(iV,,5) 

<Trad{Plf) a^^PlYl ~ <^m4PlYl 
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where the first inequality is by triangle inequality, the second inequality is by the fact that 


\\M[A,B)\\ < ||M|| II A|| |j_B|j, the third inequality is from the fact that ||P 
ATlI < WPi^i'^-PifWF < eiN,S) and ||P “|1 < |lP/"ni < llA^H^ < 


“ - -P“ll < II- 

1, and Lemma 1^ 


As a result. 


< 

< 


||m“-M2“II 

II (p“( iV + p“((A)^, ^))/2 - (P“(A, iV + ^))/2|| 

l|p“((^r A, - P“((AA, ^)^ll/2 + ||P“( Ar A, - P“((AA, J')ll/2 

14e(Af,(5) 


where the first inequality follows from triangle inequality, the second inequality is from Equa¬ 
tion ( [T0| i. 


(2) Define T" = {U^, U^, U^) and f“ = {U^, P“, U^). Then, 


rpU 

_ /fjH 

~ '^1,2,3 ’ 


11^3 -^3“ll 

= 

I|t“((A“A,t, 

< 

||r“-p“||||5| 

< 

16e(Ar,<5) 


/ tjH,H^2 

CTm4Pl.3 ) 

< 

96e{N,d) 


^m4Pi.i r 


+ ■ 


+ ■ 


10e(Ar,<5) 


where the first inequality is from triangle inequality, and the fact that ||T(A, P,C)|| < 


||r|| II A|| ||P|| ||C||, the second inequality is by the fact that ||r” — T“|| < ||PiA^’^ ~ — 

llPiA^’^ ~ Pi^ 23 ^\\f — ^)’ I A” II — llAA'a^ll — 1’ Lemmaj^ the third inequality is by 

algebra. □ 


E.4 Accucary of Tensor Decomposition 


Algorithm 3 A Procedure That Finds Symmetric Decomposition based on Second and Third Order 
Moments 


1 : Input: number of components m, perturbed version M 2 and M 3 of matrix M 2 and tensor M 3 
satisfying M 2 = YJiLi ® M 3 = YhLi 

2 : Output: {^i}™ j, estimate of {0i}™ ^ 

3: Whiten. Perform an SVD on M 2 = UDU^, and let W = (where Um is matrix that 

contains the first m columns of U, Dm is the diagonal matrix with P’s first m diagonal entries), 
letG = M 3 {W,W,W). 

4: Decompose Tensor. Apply robust tensor power iteration algorithm in IT] with input G to get 


{Dl, . . .,Vm} 

5: for f = 1,2,..., m do 


6 : 


Let Zi = 


T{Vi,Vi,Vi) ' 

7: Recover 6 i = 

8 : end for 


In this section, we introduce a lemma that is implicit in Ql regarding using orthogonal decomposi¬ 
tion as a subprocedure for full rank symmetric tensor decomposition. (See Theorem 5.1 of 11.) For 
completeness, we include the proof here. 
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Lemma 11. There are universal constants Ci, C 2 such that the following holds. Suppose a matrix 
M 2 and a tensor M 3 has the following structure: 

m 

M 2 = 0 9^ 

2 = 1 


M 3 = ^ 7119 ^ ® 9 ^® 9 ^ 


2=1 

where tti > Ofor all i. And we are given their perturbed version M 2 and M 3 , such that 

IIM 2 -M 2 II <Ep 
IIM3 — M3II < Et 

Ep < tTm(0)^7rniin/2 

Ep 


where 


ci(- 


< i 


( 11 ) 

( 12 ) 


'o-m(0)3 crm(0)^ rn 

min 

where 0 = {9i,... 6m) and Tr^in = mini Then the outputs {0i}™ 3 of Algorithn^^on input M 2 
and M 3 satisfies the following. With appropriate setting of parameters (with respect to parameter 
rj), with probability 1 — t], there is a permutation a : [m] —^ [to] such that 

\\Q a 11^ f^i( 0 )^ Ep , Et ^ 

\% 0.wl|_C2 (^^(0)2 +^^( 0 ) 3 ) 


Proof. 1. We first put 0 into canonical forms by appropriate scaling of its columns. Let 0 = 
(dim .. , dm) = 0diag(7r) 2 , we have 


M2 = ^0, 
2 = 1 
m - 

M3 = ^ — 

. /TT 


2 = 1 

Recall that W is defined as UmDm^, where M 2 = UDU^. Hence H^^M 2 l^ = Im- Suppose that 
M 2 W has the following eigendecomposition: 

W^M2W = AAA^ 

Then let W = WAA~^AA, W is one of the matrices such that M 2 W = Im- Define M = 
W^Q.M = W^Q. 

2. If Equation ( [TT| l holds, then Ep < (Tm(0)^7rniin/2 < am{M 2 )/2, then we have the following; 

IH^IMII^II < 

CTm(0) 

|H"^IU|W"^|| <3(Ti(0) 


|00t _ WW(\ < 


O’m(0)^ 

4:Ep 


.(0) 


|M||,||M|| <2 
|M- M|| < 


E. 


CTmiO)^ 
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3. Define G = M^iW, W, W) = ®Mi® M„ and recall that G = M^{W, W, W). We 


have the following perturbation bound for G. Define R to be diagnoal tensor ® &i® Ci. 

Note that ||i?|| < ^— . Therefore, 

V'TTmin 

IIG-GII 

= \\M^{W,W,W) - M:i{W ,W ,W)\\ 

< M:i){W,W,W)\\ + \\M:i{W-W,W,W)\\ + \\M:i{W,W-W,W)\\ + \\M:i{W,W,W-W)\\ 

= IKMg -M3)(fT,W^,fT)|| + \\R{M - M,M,M)\\ + M - M, M)|| + M, M - M)|| 

< IIM3 - MMWf + \\R\\\\M - MllliMf + \\R\\\\M\\\\M\\\\M - M\\ + |li?||||Mf ||M - M\\ 

< := E (13) 

O-m(0)^ V^minCTm(0)^ 

where the first inequality is by triangle inequality, the second inequality is by the fact that 
\\T{A, B,C)\\ < IITIIII A|| ||i?|| ||G|j, the third inequality is from results of our step 2 and the fact 
that II11 ^ E'j'. 

4. If Equation ( fT2| ) holds, then E < ^ < Ci - for Gi required by Theorem 5.1 in IT]. 

Thus, applying robust tensor power algorithm in HI, with probability at least 1 — p, there exist a 
permutation a : [m] —>■ [m] such that 


11^* - ^<t(*)II < SyErlE 


(14) 


5. We conclude by providing the reconstruction error bound. For notational simplicity, assume ct( ) 
is identity mapping. Define 



and recall that 


1 


1 


M3(Wvi,Wvi,Wvi) G{vi,Vi,Vi) 


The recovery formula is 


{W^Vv^ 


First, I ^ ~ ^ I can be bounded as follows: 


1 1 



where the first inequality is by triangle inequality, the second inequality is by the fact that ||>1 • i3|| < 
II A|| ||i?||, the third inequality is by Equation ( [T3| ) in step 3 and Equation ( [l4| l in step 4, the fourth 
inequality is by algebra. 
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Then the reconstruction error can be bounded as follows: 


< 


< 


< 


< 


< 


< 


110 - 




II, _ _ ^ IIE!a|_±)|| + II<2:1^11 + 11(1 - ‘ „r..,i 

||00t _ WW^\m\\ + ^||M, - u,|| + + 1^ - i 

liee. - w.||^ + K^iim. - .,1 + , T - ‘ |pr.|| 

y'^min Ai 

^Ep ai(0) , 


O'm(0)^ •\/’’’min 

46cri(0)^ d,ET 


.{QY 




UEp 


fTl(0), Ep Ep , 

^min <^m{<dY CTmi&Y 


Wher the first inequality is by triangle inequality, the second inequality we use the fact that Ij A-B\\ < 
||A||||i?|| and the fact that Mi — OiyELi, Zi = ypEl, 00 ^' 0 i = 6i, WW"^ = {W^'W^, the 
third inequality uses the fact that || 0 i|| = || 0 ei||/yTiy^ < cri( 0 )/yTiy^ and ||{;i|] = 1 , in the 
fourth inequality we use results in item 2 and item 4, the fifth inequality is from the definiton of 
E and algebra, in the sixth inequality we use the fact that tTm(0) < and letting 

C 2 = 552. □ 


Now we apply the above lemma into our symmetrized cooccurence matrices M 2 and M 3 . 

Corollary 2. Suppose N is large enough such that Assumption holds. Then, on event E, with 
probability 0.9 over the randomization of D calls of Algorithni^ for all u G V, the matrices 
0“ = (0“,..., obtained at the end of line 9 are such that there exists a permutation matrix 11“, 


||(17“)^0“ 


0“n“|| < 2 c2 


m i{N,S) 

('^minY (T„d(P]^3^)3cr^(0“)3 


Proof By Assumption]^ we first see that conditioned on event E, by Lemma) > 
o’m(0“)/2. Thus the conditions of Lemmahold, by taking 0 = (17“)^0", tt = 7 r“. We thus 
get that with probability greater than 1 — 0.1/D over the randomness of Algorithm]^ there is a 
permutation matrix 11 “ such that for alH = 1 , 2 ,..., m. 


||(L“)^o“-(0“n“),|j 

^ ^ ai((17“)^Q“) e{N,S) ^ eiN,S) 

^ 2 c ^ _ 

^ (KinY a^4PYY^m(.O^Y 


where the second inequality we use the fact that (Ti((17“)^0“) = ||(17“)^0“|| < || 0 “|| < 
since 0“ is a column stochastic matrix. Therefore, 


< 

< 


||(f/“)^o“- (0“n“)|| 
||(t/“)^o“- ( 0 “n“)||F 

2^ m _ e{N,6) 

^ (^min)^ a^d{PYYf^m{0'^Y 




(15) 


We conclude the proof by applying union bound over all u GV. 


□ 
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F Putting Everything Together - Proof of Theorem 


Proof. (Of Theorem (1) We first give the recovery accuracy of observation matri¬ 
ces. The final step of recovery is = C7“0“. Note that if N is at least 


D 


C max(In ^, 
event E, we have 


In 


D 

5 ’ cr?cr?7I-3 . 


-Inf 


), then Assumption holds, hence conditioned 


< ||(7“(t/“)^ - 17“(C/“)^||||0“|| 
^ 2yfn€{N, S) 

— / T~,U,U\ 


(16) 


where the first inequality is by the fact that || A • S]l < || A|| ||f3|j, the second inequality follows from 
the fact that ||0“|| < i/m and item (1) of Lemma]^ 

Meanwhile, by Corollary]^ we have 


< 

< 


||((>“)^o“ - 0 “n“|| 

2 ^ m e{N,S) _ 


The above two facts let us conclude that provided the size of sample N is at least 
Cmax( ^a^g 2 In f, ^ In f) (where we choose C large enough), 

2 1 3 1 min 

llo“-o“n“|| 

< ||[7"((7“)^0“ - 0 “|| + ||i7“((7“)^0“ - C/“ 0 “n“|| 

^ 2v^e(W,(5) ^ m 

< mincrm(0“)^e/32 (17) 

v£V 

< e 

where the first inequality is by triangle inequality, the second inequality is by Equations 
and ( [Thl l, the third inequality follows from the choice of N, in the last inequality we use the fact 
that crm(O^) < 1. Therefore by Equation ( [T7] i and Theorem]^ 

a„(0“n“) > am(O^) - min a^(0^)y32 > a™(0“)/2 (18) 

vGV 


(2) We now provide guarantees on the accuracy of transition probabilities and initial probabilities. 
In particular, we prove ||Q“ — Q“(n“, If", || < e, the other three inequalities can be handled 

similarly. As we have already seen from Equation ( fTT] ), for all u G V, 

< 2max(|i(6“)tf, ||(n“^(0“))^f )||C>“ -d“n“|| 

< mincrm(0”)^e/16 
vGV 


where the first inequality is by Theorem]^ the second inequality uses the fact that ||(0“)^|| = 
l/(Tm(0“), ||(0“n“)t|| = l/cr„(6“n") and Equation ([T^. 


Conditioned on event E, by the choice of N, it is also true that the cooccurence tensor if f is 
such that 


II ^ u ,' 7 t { u),u pii,7r(ii),n 
1 ^ 2,24 ~ -^ 2 , 2,1 


< mincrm(0’')^e/32 
vGV 


(19) 
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Therefore, 

IIQ" - Q“(n“,n’"(“),n“)|| 

< IKPaXi”^’” “ 

+Il4“d“^’“((0“)^^ - (6“n")^t^ {0"f^)\\ 

+Il4“^^“^’“((0“^“)^^ - (o’^(“)n^(“))^t^ (0“)^^)|| 

< ||(PS,'“>’“ - 4“i-‘-*’")||n.»||0”*||= + iipS!"’’“II ■ ll( 0 “f' - (6-nr'll(max||0"tf + 

’ ■ ’ ’ vGV '>>c=.\/ 


v^V 


max IIoil'll max II6"^II +max|l6’'^f) 

v^V vGV vGV 


< 


where the first inequality is by triangle inequality, the second inequality is by the fact that 
||T(A, B,C)\\ < ||T||II A|| ||P|| lie'll, the third inequality is by Equations ( [T9] l and ( [T7] i. 

□ 


G Matrix Perturbation Lemmas 

Theorem 3 (Weyl’s Theorem). If A, E are matrices in with m > n. Then, 

\a,{A + E) - adA)\ < ||P|| 


Theorem 4 (Wedin’s Theorem). If A, E are matrices in with m > n. Let A have singular 

value decomposition: 

f UJ \ / 0 \ 

172^ U ( V2 ) = 0 S2 

\uj ) \ 0 0 J 

Let A = A + E have the singular value decomposition: 

/ c/r \ / 0 \ 

C/2^ U ( Vi ^2 ) = 0 S2 

\uj J V 0 0 y 

If there ii <5 > 0, a > 0 such that mini cri(Ei) > a + 5, maxi ai(T, 2 ) < a, then 


II sin $11 < 


Mi 

5 


where $ is the matrix of principal angles between range{Ui) and range{Ui). 
Theorems. If A, E are matrices in with m > n, let A = A + E. Then, 

||it_At||<2max(||it||2j|yit|j2)p|j 


H Compressed observation matrices produced by Spectral-Tree for eight 
ENCODE cell types 
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H3K27ac 

H3K27me3 

H3K36me3 

H3K4me1 

H3K4me2| 


0.0 0.05 0.02 0.0 o.o; 
^0.11 0.0 0.0 0.0 
0.0 0.0 0.0 0.0 0.0 
o.iio.os ^as^ o.oi 0.0 ^ 
lo.oe^l 0.0 0.0 0.0! 


H3K4me3^H 0.0 0.27 0.0 0.0 0.0 
H3K9ac^|o.020.180.03 0.010.06 
H4K20mel[o.070.040.030.08 0.0 0.06 
1 2 3 4 5 

(a) Hl-hESC 




■ 0.04 0.01 0.0 0.04 
3 0.0 0.0 0.0 0.0 o.osyisi 

3io.O 0.020.021^0.01 0.0 

0.0 0.19 
0.0 0.0 0.27 
H3K4me3^Bo!l50.02 0.0 0.0 0.16 
H3K9ac^|o.29 0.010.01 0.0 0.03 
H4K20mello.01 0.01 0.010.04 0.0 0.05 
1 2 3 4 5 6 

(c) HMEC 


I 0.0 0.0 0.11 0.0 


3 0.0 0.02 0.0 0.0 0.0 

0 0.05 0. 

d|^o 


H3K27ac| 

H3K27me3 G 

H3K36me3 0.03 0.01^^ 0.0 0.05 0.01 

H3K4melH 0-1 0-02 0.0 I 

H3K4me2^^H 0.0 0.0 0.24 0.0 
H3K4me3p.24^| 0.0 0.010.04 0.0 
H3K9acb.68^H0.01 0.0 0.07 0.0 
H4K20mello.02 0.010.04 0.010.03 0.0 
1 2 3 4 5 6 

(d) HSMM 


H3K27ac0.03^^^^.17 0.0 0.0 
H3K27me3 0.0 0.0 0.0 0.0 o.oij 
H3K36me3U|o.01 0.040.06 0.0 0.01 
H3K4me10.02^^^^Ho.01 0.0 
H3K4me20.01^^^K!^0.01 0.0 
H3K4me30.01^Ba^0.04 0.0 0.0 
H3K9ac0.01^Ba6^0.03 0.0 0.0 
H4K20mel0.01 0.0 0.010.04 0.01 0.0 
1 2 3 4 5 6 

(e) HUVEC 


|-l3K27ac^^^l0.020.05 0.0 0.16 
H3K27me3 0.05 0.0 0.21 0.0 0.01 0.0 
H3K36me3 0.15 0.010.02 E o.oio.osi 
H3K4mel| 

H3K4me2 
H3K4me3 
H3K9ac 
H4K20me1 



12 3 4 

(f) K562 


H3K27me3 |6| 0.0 0.0 0.0 0.0 0.0 


H3K36me3 
H3K4mel0.020.23| 
H3K4me20.04| 
H3K4me30.02 
H3K9ac0.01 


0.0 0.01 0.010.03 0.29 0.0- 

^^Ho.02 

Ho.3 0.0 0.0 
!^0.02 0.0 0.0 
.630.03 0.0 0.0 
H4K20mel[o.010.01 0.0 0.010.04 0.0 
1 2 3 4 5 6“ 

(g) NHEK 


]1^3^ 0.0 


H3K27ac^|0.06 0.0 0.01^ 
H3K27me3 0.0 O.Oll^O.O 0.0 0.01 
H3K36me3 0.0 0.0 0.0 kfllo.03 0.0 
H3K4me10.070.19 0.010.03|o.5fl 0.0 
H3K4me2^^Ho.02 0.01&4i 0.0 

H3K4me3^E4i 0.0 0.010.02 0.0 

H3K9ac^Bo!M 0.0 0.0 0.03 0.0 
H4K20mel[o.O 0.0 0.010.02 0.01 0.0 
1 2 3 4 5 6 

(h) NHLE 


Figure 3: The compressed observation matrices estimated by Spectral-Tree for all eight ENCODE 
cell types studied, other than GM12878 which is presented in the main manuscript. 
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